Skip to content

Conversation

@MattWhelan
Copy link

Extends the existing retry limit mechanism to allow for delays between retries. Includes implementations for fixed delays and exponential backoff.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@MattWhelan MattWhelan force-pushed the backoff branch 2 times, most recently from 3034fdf to d7039d1 Compare June 10, 2025 15:32
@MattWhelan MattWhelan requested a review from sean- June 10, 2025 16:01
@dikshant
Copy link

We haven’t implemented this because cockroachdb-go uses save point based retries and because of savepoint-based retries, a transaction can continue holding locks in between retries. So that means with exponential backoff, you'll be holding locks for longer, thus putting the app at even greater risk of contention.

If you could show us that this is not the case with backoffs, we will consider adding this! Thanks!

@MattWhelan
Copy link
Author

We haven’t implemented this because cockroachdb-go uses save point based retries and because of savepoint-based retries, a transaction can continue holding locks in between retries. So that means with exponential backoff, you'll be holding locks for longer, thus putting the app at even greater risk of contention.

If you could show us that this is not the case with backoffs, we will consider adding this! Thanks!

I just pushed 9dda97b. Could you take a look at that? I'm not sure what all the implications are of separating the database transaction lifecycle from the Tx object, but aside from that, I think it's sound, and avoids the problem with holding locks during backoff.

@MattWhelan MattWhelan requested a review from rafiss June 23, 2025 20:05
}

// Vargo converts a go-retry style Delay provider into a RetryPolicy
func Vargo(fn func() VargoBackoff) RetryPolicy {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we could think of a different name to go with here. Users of cockroach-go who don't already know about the sethvargo/go-retry library may be confused to see this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to suggestions. I was trying to build an explicit integration without adding a transitive dependency.

if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)
}
if restartErr := tx.Exec(ctx, "BEGIN"); restartErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could have issues. Some libraries may not like it if we use the transaction after rolling it back. For example, I see this in the pgx code where it returns an error if you try to use a transaction that was already rolled back: https://github.com/jackc/pgx/blob/ecc9203ef42fbba50507e773901b5aead75288ef/tx.go#L205-L224

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns an error if you use a transaction object that has the closed flag set. Since we're not doing that, it ought to be ok, I would expect.

Matt Whelan added 2 commits October 21, 2025 14:33
Extends the existing retry limit mechanism to allow for delays between
retries. Includes implementations for fixed delays and exponential
backoff.
crdb/tx.go Outdated
// WithMaxRetries configures context so that ExecuteTx retries tx specified
// number of times when encountering retryable errors.
// Setting retries to 0 will retry indefinitely.
// Setting retries to 0 will not retry: the transaction will be tried only once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to avoid breaking the API here. This changes the behavior of WithMaxRetries(ctx, 0) from "retry indefinitely" to "no retries," which would affect any existing code relying on this.

Maybe we could add a separate WithNoRetries() function, and change something about the internal representation so that the outward facing API remains the same.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Done.

if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)
}
if restartErr := tx.Exec(ctx, "BEGIN"); restartErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still doesn't seem valid to me -- if we reuse the same tx object here, it will not succeed when sending BEGIN right after a ROLLBACK. (see https://github.com/jackc/pgx/blob/ecc9203ef42fbba50507e773901b5aead75288ef/tx.go#L205-L224)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we not calling tx.Rollback, but rather tx.Exec(ctx, "ROLLBACK"). Please check the comments in the code, and let me know if that makes sense.

return newTxnRestartError(rollbackErr, err)
// We have a retryable error. Check the retry policy.
delay, retryErr := retryFunc(err)
if delay > 0 && retryErr == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this. Why are we doing a retry if retryErr is nil?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added documentation on the RetryFunc function definition to clarify the semantics. Implementation of RetryFunc may return errors (typically MaxRetriesExceededError) to stop the retry process.

crdb/common.go Outdated
if delay > 0 && retryErr == nil {
// We don't want to hold locks while waiting for a backoff, so restart the entire transaction
if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong error to return here -- as per this line, when this error is displayed, it will say that there was an issue with ROLLBACK TO SAVEPOINT, which is not what this line is doing:

const msgPattern = "restarting txn failed. ROLLBACK TO SAVEPOINT " +

(This is also the wrong error to return on lines 98 and 101.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Passing the operation to newTxnRestartError.

}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also have tests for edge cases like zero BaseDelay, negative RetryLimit, and MaxDelay < BaseDelay.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if rollbackErr := tx.Exec(ctx, "ROLLBACK TO SAVEPOINT cockroach_restart"); rollbackErr != nil {
return newTxnRestartError(rollbackErr, err)
// We have a retryable error. Check the retry policy.
delay, retryErr := retryFunc(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before we enter the retry logic, we should check for context cancellation:

if err := ctx.Err(); err != nil {
  return err
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

NewRetry() RetryFunc
}

type LimitBackoffRetryPolicy struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is public-facing so should have a comment explaining what it is and how to use it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

crdb/retry.go Outdated
}

// ExpBackoffRetryPolicy implements RetryPolicy using an exponential backoff with optional
// saturation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what "saturation" means here. Can we write this comment to explain more how to use this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, reworded comment.

actualDelays := make([]time.Duration, 0, len(expectedDelays))
rf := policy.NewRetry()
for {
delay, err := rf(nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're always passing a nil error here -- can we add tests that check the behavior with a non-nil error?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Document LimitBackoffRetryPolicy, ExpBackoffRetryPolicy, and Vargo adapter
with detailed examples.
Preserve backward compatibility by making WithMaxRetries(ctx, 0) mean
unlimited retries (original behavior).
Add WithNoRetries() for disabling retries and introduce sentinel constants.

Enhance RetryFunc documentation to clarify return value semantics and
add additional test cases.
Copy link

@sravotto sravotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Please take another look (will squash the commits once we are done).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss and @sean-)

if rollbackErr := tx.Exec(ctx, "ROLLBACK TO SAVEPOINT cockroach_restart"); rollbackErr != nil {
return newTxnRestartError(rollbackErr, err)
// We have a retryable error. Check the retry policy.
delay, retryErr := retryFunc(err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return newTxnRestartError(rollbackErr, err)
// We have a retryable error. Check the retry policy.
delay, retryErr := retryFunc(err)
if delay > 0 && retryErr == nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added documentation on the RetryFunc function definition to clarify the semantics. Implementation of RetryFunc may return errors (typically MaxRetriesExceededError) to stop the retry process.

crdb/common.go Outdated
if delay > 0 && retryErr == nil {
// We don't want to hold locks while waiting for a backoff, so restart the entire transaction
if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Passing the operation to newTxnRestartError.

if restartErr := tx.Exec(ctx, "ROLLBACK"); restartErr != nil {
return newTxnRestartError(restartErr, err)
}
if restartErr := tx.Exec(ctx, "BEGIN"); restartErr != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we not calling tx.Rollback, but rather tx.Exec(ctx, "ROLLBACK"). Please check the comments in the code, and let me know if that makes sense.

NewRetry() RetryFunc
}

type LimitBackoffRetryPolicy struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

crdb/retry.go Outdated
}

// ExpBackoffRetryPolicy implements RetryPolicy using an exponential backoff with optional
// saturation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, reworded comment.

actualDelays := make([]time.Duration, 0, len(expectedDelays))
rf := policy.NewRetry()
for {
delay, err := rf(nil)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

crdb/tx.go Outdated
// WithMaxRetries configures context so that ExecuteTx retries tx specified
// number of times when encountering retryable errors.
// Setting retries to 0 will retry indefinitely.
// Setting retries to 0 will not retry: the transaction will be tried only once.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants